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Abstract 

Graph edit distance (GED) is a powerful and flexible graph matching 
paradigm that can be used to address different tasks in structural pattern 
recognition, machine learning, and data mining. In this paper, some new 
binary linear programming formulations for computing the exact GED 
between two graphs are proposed. A major strength of the formulations 
lies in their genericity since the GED can be computed between directed 
or undirected fully attributed graphs (i.e. with attributes on both vertices 
and edges). Moreover, a relaxation of the domain constraints in the for¬ 
mulations provides efficient lower bound approximations of the GED. A 
complete experimental study comparing the proposed formulations with 4 
state-of-the-art algorithms for exact and approximate graph edit distances 
is provided. By considering both the quality of the proposed solution and 
the efficiency of the algorithms as performance criteria, the results show 
that none of the compared methods dominates the others in the Pareto 
sense. As a consequence, faced to a given real-world problem, a trade-off 
between quality and efficiency has to be chosen w.r.t. the application 
constraints. In this context, this paper provides a guide that can be used 
to choose the appropriate method. 


1 Introduction 

Graphs are data structures able to describe complex entities through their el¬ 
ementary components (the vertices of the graph) and the relational properties 
between them (the edges of the graph). For attributed graphs, both vertices 
and edges can be characterized by attributes that can vary from nominal la¬ 
bels to more complex descriptions such as strings or feature vectors, leading 
to very powerful representations. As a consequence of their inherent generic¬ 
ity and their ability to represent objects as composition of elementary entities, 
and thanks to the general improvement of computing power, graph representa¬ 
tions have become more and more popular in many application domains such 
as computer vision, image understanding, biology, chemistry, text processing or 
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pattern recognition. With this emergence of the use of graphs, new algorith¬ 
mic issues have arised such as graph mining [Q, graph clustering [2] or graph 
classification j3]. 

A major issue related to the graph-based algorithms mentioned above is the 
computation of a (dis)similarity measure between two graphs. A huge number 
of algorithms have been proposed in the literature to solve this problem, which 
is particularly crucial for machine learning issues. They can be categorized as 
embedding-based, methods vs. matching-based methods. 

In embedding-based methods , the key-idea is to project the input graphs to be 
compared into a vector space in order to benefit from the distance computation 
designed for vectorial representations. Among existing approaches, some are 
based on an implicit projection, through the use of graph kernels 110 whereas 
other methods make the projection explicit, through the computation of a fea¬ 
ture vector for each graph to be compared. The features can result for example 
from frequencies of appearance of specific sub-structures 0H or from a spectral 
analysis of the graphs 00. Embedding-based methods are generally compu¬ 
tationally effective since they do not involve a complete matching process. On 
the other hand, they do not take into account the complete relational properties 
and do not provide the matching between vertices and edges. 

A second way to compute the dissimilarity between two graphs consist in 
using matching-based methods. In such a case, computing the similarity be¬ 
tween two graphs requires the computation and the quanfitication of the "best" 
matching between them. Different kinds of matching algorithm have been used 
for such a computation. They differ according to the kind of constraints that 
must be respected and to those that can be relaxed. As an example, maximum 
common subgraph and/or minimum common supergraphs have been used in 
mm to derive a graph distance metric. Since exact isomorphism rarely occur 
in pattern analysis applications, another interesting class of matching problem 
for similarity evaluation is the error tolerant graph matching problem. A graph 
matching is said to be error-tolerant when the matching tolerates differences 
on the topology and/or the attributes of the vertices and the edges. Adjacency 
matrix eigendecomposition m or graduated assignment methods (HE] are 
examples of methods that have been used to tackle this problem. Another well 
known error-tolerant matching-based method that can be used to compute a 
dissimilarity measure between two graphs is the graph edit distance (GED). In 
this method, the graph matching process and the dissimilarity computation are 
linked through the introduction of a set of graph edit operations (e.g. node 
insertion, node deletion). Each edit operation is characterized by a cost, and 
the graph edit distance is the total cost of the least expensive sequence of edit 
operations that transforms one graph into the other one. A major advantage of 
graph edit distance is that it is a dissimilarity measure for arbitrarily structured 
and arbitrarily attributed graphs. In contrast with other approaches, it does not 
suffer from any restrictions and can be applied to any type of graph, including 
hypergraphs m- Graph edit distance has been used to address various graph 
classification problems cairn Eg. However, a main drawback of graph edit 
distance is its computational complexity which is exponential in the number of 
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nodes of the involved graphs. Consequently, computation of graph edit distance 
is feasible for graphs of rather small size only. In order to overcome this restric¬ 
tion, some number of fast but suboptimal methods have been proposed in the 
literature (e.g. HH ESI HU HH1231 EH)- On the other hand, only few optimal 
methods have been proposed to postpone the graph size restriction [25,, [2~6j|. 

This paper tackles the problem of graph edit distance computation by propos¬ 
ing two main contributions. The first one consists in solving the graph edit 
distance problem with binary linear programming. More precisely, two original 
exact formulations of the GED are provided. They are very general, since they 
are able to compute the GED between directed or undirected fully attributed 
graphs (i.e. with attributes on both vertices and edges). Furthermore, a re¬ 
laxation of the domain constraints in the formulations provides efficient lower 
bound approximations of the GED. On the basis of these formulations, the sec¬ 
ond contribution is a very complete comparative study where eight algorithms 
for exact and approximate graph edit distances are compared on a set of graph 
datasets. By considering both the quality of the proposed solution and the effi¬ 
ciency of the algorithms, we show that none of the compared methods dominates 
the others in the Pareto sense. As a consequence, faced to a given real-world 
problem, a trade-off between quality and efficiency has to be chosen w.r.t. the 
application constraints. In this context, this paper provides a guide that can be 
used to choose the appropriate method. 

This paper is organized as follows: Section 2 presents the important defini¬ 
tions necessary for introducing our formulations of the GED. Then, section 3 
reviews existing approaches for computing GED with exact and inexact meth¬ 
ods. Section 4 describes the proposed binary linear programming formulations. 
Section 5 presents the experiments and analyses the obtained results. Section 6 
provides some concluding remarks. 


2 Problem statement 

In this paper, we are interested in computing the graph edit distance between 
attributed graphs. 

Definition 1. An attributed graph G is a 4-tuple G = (V, E, fj,, £), where : 

• V is a set of vertices, 

• E is a set of edges, such that Ve = (i. j) G E,i £ V and j £ V, 

• fi : V —> Ly is a vertex labeling function which associates the label n(v) 
to all vertices v of V, where Ly is the set of possible labels for the vertices, 

• £ : E —> Le is an edge labeling function which associates the label £(e) to 
all edges e of E, where Le is the set of possible labels for the edges. 

The vertices (resp. edges) label space Ly (resp. Le) may be composed of 
any combination of numeric, symbolic or string attributes. 
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A graph G is said simple if it has no loop (an edge that connects a vertex to 
itself) and no multiedge (several edges between the same vertices). In this case, 
E C {(i,j) £ V x V/i y^ j} and an edge can be unambiguously designated by 
the pair of edges it connects. Otherwise, G is a multigraph and A is a multiset. 
A graph G is said undirected if the relation E is symmetric, i.e. if its edges have 
no orientation. In this case, V(i, j) £ E, (j,i ) £ E and (i,j) = ( j,i ). Otherwise, 
G is a directed graph. 

Definition [l] allows us to handle arbitrarily structured graphs (directed or 
undirected, simple graphs or multigraphs) with unconstrained labeling func¬ 
tions. 

Many applications using graph-based representations need to evaluate how 
two graphs are similar, or how they differ. The graph edit distance is commonly 
used to measure the dissimilarity between two graphs. Graph edit distance is 
an error-tolerant graph matching method. It defines the dissimilarity of two 
graphs by the minimum amount of distortion that is needed to transform one 
graph into another [Si- 


Definition 2. The graph edit distance d (.,.) is a function 


d : gxg-+R + 

(G 1 ,G 2 )^d(G 1 ,G 2 ) 


k 

min 

o=(oi,...,ofc)er(Gi,G 2 ) . , 

l—l 



where Gi = (Vi, E\, /xi, £i) and G 2 = (E 2 , E 2 ,^l 2i £ 2 ) are two graphs from the set 
g and T(Gi, G 2 ) is the set of all edit paths o = ( 01 ,..., Ok) allowing to transform 
Gi into G 2 . An elementary edit operation Oi is one of vertex substitution 
(vi —> v 2 ), edge substitution (ei —>• e 2 ), vertex deletion (ui —> e), edge deletion: 

(ei —> e), vertex insertion (e —¥ v 2 ) and edge insertion (e — > e 2 ) with V\ £ V \, 

V 2 £ V 2 , e\ £ E\ and e 2 £ E 2 . e is a dummy vertex or edge which is used to 
model insertion or deletion. c(.) is a cost function on elementary edit operations 
Oi that satisfies 

• c(i>i —> v 2 ) < c(i>i —> v) + c(y v 2 ) 

• c(ei —> e 2 ) < c(ei —>• e) + c(e —> e 2 ) 

• c{v 1 —> e) < c(ui —► v) + c(v —> e) 

• c(ei —t e) < c(ei —> e) + c(e —> e) 

• c(e v 2 ) < c(e —tv) + c(v —> v 2 ) 

• c(e —> e 2 ) < c(e —> e) + c(e —> e 2 ) 

Moreover, in order to guarantee the symmetry property (d(G 1 , G 2 ) = d(G 2 , Gi)), 
the reverse edit path should result in the same cost. So, these costs have 
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to be defined in a symmetric manner so that c(v\ —► v 2 ) = c(v 2 —> v \), 
c(ei —> e 2 ) = c(e 2 —► ei), c(v —► e) = c(e v) and c(e —► e) = c(e —► e). 

When the graph edit distance is computed between unlabeled graphs, the 
identity property ( d(Gi,G 2 ) = 0 <t=> Gi = G 2 ) imposes that the substitution 
costs are equals to 0. The insertion and deletion costs are then set to a constant. 
In the more general case where the graph edit distance is computed between 
attributed graphs, edit costs are generally defined as functions of vertices (resp. 
edges) labels. More precisely, substitution costs are defined as a function of the 
labels of the substituted vertices (resp. edges), whereas insertion and deletion 
are penalized with a value linked to the label of the inserted/deleted vertex 
(resp. edge). 


c{v 1 v 2 ) = c(v 2 -Mff) = fv{ni(vi), H 2 (v 2 )) 

c(e 1 ->• e 2 ) = c(e 2 -t ei) = /e(£i(ei),6(e 2 )) 
c(v ->• e) = c(e -+v) = g v {n{v)) 
c(e —t e) = c(e -5> e) = ff e (£(e)) 

3 Related work 

The graph edit distance, which is the minimum cost associated to an error cor¬ 
recting graph matching, has been the subject of many studies in the literature. 
Several papers propose surveys of these works |271 1231122], They distinguish 
exact approaches from approximations. Indeed, as stated in [30| . the graph edit 
distance problem is NP-hard. It is then prohibitively difficult to compute the 
graph edit distance for large graphs, and the literature reports exact methods to 
compute GED only for small graphs, while approximations by means of upper 
and lower bounds computation are often used for larger graphs. 

3.1 Exact approaches 

A first family of exact computation of the graph edit distance is based on the 
widely known A* algorithm. This algorithm relies on the exploration of the 
tree of solutions. In this tree, each node corresponds to a partial edition of the 
graph. A leaf of the tree corresponds to an edit path which transforms one of 
the input graphs into the other one. The exploration of the tree is guided by 
developing most promising ways on the basis of an estimation of the graph edit 
distance. For each node, this estimation is the sum of the cost associated to the 
partial edit path and an estimation of the cost for the remaining path, the latter 
being given by a heuristic. Provided that the estimation of the future cost is 
lower than or equal to the real cost, an optimal path from the root node to a 
leaf node is guaranteed to be found m- a simple way to fulfill this constraint 
would be to set the estimation of the future cost to zero, but this may lead to 
explore the whole tree of solutions. Indeed, the smaller the difference between 
the estimation and the real future cost, the fewer nodes will be expanded by the 
A* algorithm. However, the other extreme which consists in computing the real 
cost for the remaining edit path would require an exponential time. The different 
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A*-based methods published in the literature mainly differ in the implemented 
heuristics for the furture cost estimation which correspond to different tradeoffs 
between approximation quality and their computation time [25} 1321- 

In an other family of algorithms, the graph edit distance is computed by 
solving a binary linear program. Almohamad and Duffuaa [[331 propose a bi¬ 
nary linear programming formulation of the weighted graph matching problem 
which aims at determining the permutation matrix minimizing the L\ norm of 
the difference between adjacency matrix of the input graph and the permuted 
adjacency matrix of the target one. Later, Justice and Hero [34] also proposed 
a BLP formulation of the graph edit distance problem aiming at determining 
the permutation matrix which minimizes the cost of transforming G\ into G 2 , 
with G\ and G 2 two unweighted and undirected graphs. The criterion to be 
minimized (see eq. [l) takes into account costs for matching vertices, but the 
formulation does not integrate the ability to process graphs that carry labels on 
their edges. 


11 11 -j 

d(G u G 2 ) = + - ||^ - PA 2 P T \\ 1 (1) 

i= 1 3=1 

where C'. ( J is the cost for matching the i th vertex in G\ and the j th vertex in 
G 2 . A\ (resp. A 2 ) is the adjacency matrix of G\ (resp. G 2 ), and P is an 
orthogonal permutation matrix such that PP T = P T P = I. A mathematical 
transformation is used to transform this non linear optimization problem into a 
linear one. The modeling of graphs by means of adjacency matrix restricts the 
formulation to the processing of simple graphs. 

3.2 Approximations 

Considering that exact computation of graph edit distance can be performed in 
a reasonable time only for small graphs, many researchers have focused their 
effort on the computation of approximations in polynomial time. For example, in 
their paper [33], Justice and Hero have proposed a lower bound of the graph edit 
distance which can be computed in 0(n 7 ) by extending the domain of variables 
in P from {0,1} to [0,1]. In the same paper, they also proposed an upper bound 
that can be computed in 0(n 3 ) by determining vertex correspondance based 
only on the vertex term of eq. [l] thanks to the Hungarian method (also called 
Munkres assignment algorithm). The remaining part of the cost is deduced from 
the permutation matrix determined in the previous step. In a quite similar way, 
Riesen et al. m propose to first exploit a cost matrix for vertex substitution, 
insertion or deletion in order to determine the vertex assignment thanks to the 
Munkres algorithm with a complexity of 0((ni + n 2 ) 3 ) in the number of nodes 
rii = |Vi| and n 2 = \V 2 | of the involved graphs. The vertex assignment is then 
used to infer an edit path which transforms one graph into the other and whose 
associated cost is an upper bound of the graph edit distance. 

In [22], Neuhaus et al. propose another approximation based on A*-based 
method. The first one, called A*-BEAMSEARCH, propose to prune the tree of 
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solutions by limiting the number of concurrent partial solutions to the q most 
promising ones. At the end of the algorithm, a valid edit path and its associated 
cost are provided, but there is no guarantee that it corresponds to the optimal 
one, since the latter may have been eliminated in earlier steps of the algorithm. 
The parameter q , corresponding to the number of concurrent partial solutions 
to keep, allows to manage the trade-off between combinatorial cost and quality 
of the approximation. This method provides an upper bound of the exact graph 
edit distance. In the same paper, a method called A*-PATHLENGTH proposes to 
speed up the access to a leaf node in the tree of solutions by giving a higher 
exploration priority to long partial edit paths. This strategy is motivated by 
the observation that first assignments are the most computationally expensive 
and that they are rarely called into question. 

More recently, in [55] , the vertex assignment computed by means of bipartite 
graph matching is used as an initialization step for a genetic algorithm which 
attempts to improve the quality of the approximation. Indeed, from any vertex 
assignment, it is possible to derive an edit path and finally compute its cost m- 
The vertex assignment which is optimal in terms of vertex subtitution is not al¬ 
ways optimal for the whole edit path. However, it has been observed that it may 
only differ with few assignments. In the proposed genetic algorithms, popula¬ 
tion individuals correspond to different vertex mappings. The initial population 
is generated by deriving mappings that are mutated version of the one that 
has been determined by the hungarian algorithm. The probability of a vertex 
mapping to be selected is linked to the vertex substitution cost. The lower the 
corresponding edit distance, the best the individual fits the objective function. 
The genetic algorithm iterates by selecting and mixing several mappings. 

Fischer et al. [55] propose to integrate in the A* algorithm a heuristic based 
on a modifed Hausdorff distance. Given two graphs G\ and G 2 and C a cost 
matrix for vertex substitution]]] the Hausdorff Edit Distance is defined by 


HED{G U G 2 ,C) = ]T 

uGV\ 


nun C(u,v)+ > mm C(v,u) 

wGViUe ' uGViUe 

VGV2 


which can be interpreted as the sum of distances to the most similar vertex in 
the other graph. This distance is computed in a time complexity of 0(ni.ri2). 

Graph edit distance approximations have also been proposed in a proba¬ 
bilistic framework [1551 [36] where the objective is to find the vertex assignment 
that maximizes the a posteriori probability considering vertex attributes. How¬ 
ever, unlike the methods formerly presented, the corresponding heuristics are 
unbounded and can not be exploited by branch and bound algorithms to prune 
the tree of solutions or to efficiently prioritize its exploration in the A* algo¬ 
rithm. 

1 It also integrates vertex insertion and deletion costs 
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4 Graph edit distance using binary linear pro¬ 
gramming 

In this article, the graph edit distance problem is modeled by a Binary Linear 
Program (BLP). A BLP is a restriction of integer linear programming (ILP) 
where the variables are binary. Hence, its general form is as follows: 

min c T x (2a) 

X 

subject to Ax < b (2b) 

ie{o,i} n (2c) 


where c € R”, A £ R raxm aiK ] ^ g are data of the problem. A solution of this 
optimization problem is a vector x of n binary variables. A is used to express 
linear inequality constraints (2bI. If the program is feasible, i.e. if it has such 
solutions, then the optimal solution is the one that minimizes the objective 
function (2a I and respects constraints (2b I and (2c I. The objective function 
c T x is a linear combination of variables of x weighted by the components of the 
vector c. 

In this section, we present the two formulations we wrote for the GED. 
Then, we present how the formulations are solved. Finally, we discuss how the 
relaxation of the formulations can provide a lower bound of the GED. 


4.1 Modelling the GED problem 

In this subsection, we first define in |4.1.1|t he variables used for formulating the 
GED as a BLP. Then, we describe in |4.1.2| the objective function of the program 
and in |4.1.3| the linear constraints that must be satisfied to correctly match the 
two graphs. 

4.1.1 Variable and cost functions definitions 

Our goal is to compute the graph edit distance between two graphs Gi = 
(V±,Ei, ni,£i) and G 2 = (V 2 , E 2 , H 2 , £ 2 )- In the rest of this section, for the 
sake of simplicity of notations, we consider that the graphs G\ and G 2 are sim¬ 
ple directed graphs. However, let us emphasize that the formulations given in 
this section can be applied without modification to multigraphs, and that the 
undirected case only needs some slight modifications (please refer to appendix 

0 - 

In the GED definition provided in section [2] the edit operations that are 
allowed to match the graphs G\ and G 2 are (i) the substitution of the label of 
a vertex (respectively an edge) of G\ with the label of a vertex (resp. an edge) 
of G 2 , (ii) the deletion of a vertex (or an edge) from G\ and (iii) the insertion 
of a vertex (or an edge) of G 2 in G\. For each type of edit operation, we define 
a set of corresponding binary variables: 












M(i,k) G v x x V 2 , 

{ 1 if i is substituted with 
0 otherwise. 


k, 


• M(ij, kl) £ Ei x E 2 , 

1 if ij is substituted with kl, 
0 otherwise. 


llij.kl — 


• Vi G Vi, Ui 


1 if i is deleted from G i 
0 otherwise. 


• Mij G Ei,e i:j 


1 if ij is deleted from G\ 
0 otherwise. 


• Mk G V 2 ,v k 


1 if k is inserted in Gi 
0 otherwise. 


• VfcZ e E 2 ,f kl 


1 if A:? is inserted in G\ 
0 otherwise. 


Using these notations, we define an edit path between G\ and G 2 as a 
6-tuple (x,y,u,v,e,f) where x = (x itk )( iik ) eVlXVa , y = (yij,ki)(ij,kl)eE 1 xE 2 , 
u = (tii)iev e = (eij)ij€Ei> v = {vk)kev 2 and f = ( fki)kleE 2 ■ 

In order to evaluate the global cost of an edit path, elementary costs for 
each edit operation must be defined. We adopt the following notations for these 
costs: 


• M(i, k) £ V i x V 2 ,c(i —> k) is the cost of substituting the vertex i with k, 

• V(ij, kl) £ Ei x E 2 , c(ij — > kl) is the cost of substituting the edge ij with 
kl, 

• Mi £ Vi,c(i —» e) is the cost of deleting the vertex i from G i, 

• Mij £ Ei , c{ij —> e) is the cost of deleting the edge ij from Gi , 

• Mk £ V 2 , c(e —► k) is the cost of inserting the vertex k in Gi, 

• Mkl £ E 2 ,c(e —> kl) is the cost of inserting the edge kl in G\. 

These cost functions traditionally depend on the labels of the vertices and 
of the edges. Table [T] gives a summary of the notations. 


4.1.2 Objective function 

The objective function ([3]) is the overall cost induced by applying an edit path 
(x, y, u, v, e, f) that transforms a graph Gil into a graph G 2 , using the elemen¬ 
tary costs of table |T] In order to get the graph edit distance between Gi and 
G 2 , this cost must be minimized. 
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Type 

Edit operation 

G\ 

g 2 

Cost 

Variable 

Vertex 

Substitution 

i 

k 

c(i —I k) 

%i,k 

Vertex 

Deletion 

i 

X 

c(i -s> e) 

Hi 

Vertex 

Insertion 

X 

k 

c(e —> k) 

Vk 

Edge 

Substitution 

i'J 

kl 

c(ij kl) 

Vij,kl 

Edge 

Deletion 

ij 

X 

c{ij -> e) 

e ij 

Edge 

Insertion 

X 

kl 

c(e —► kl) 

fkl 


Table 1: Summary of the notations for the GED framework 


min 

x,y ,u,v,e,f 



+ E E c(ij ->• kl) ■ y ijt u 

ij&Ei kl£E 2 


+ 


E< 

ievi 


<* e) 


k&V 2 


c(e —>• k ) 




+ E c (d e ) ■+ E c ( e 

ij£Ei fcZG-E/2 


fcZ) 



(3) 


4.1.3 Constraints 

The constraints presented in this part are designed to guarantee that the ad¬ 
missible solutions of the BLP are edit paths that transform Gi in a graph which 
is isomorphic to G 2 . An edit path is considered as admissible if and only if the 
following conditions are respected: 

1. it provides a one-to-one mapping between a subset of the vertices of G\ 
and a subset of the vertices of Go.. The remaining vertices are either 
deleted or inserted, 

2. it provides a one-to-one mapping between a subset of the edges of G\ and 
a subset of the edges of Gi- The remaining edges are either deleted or 
inserted, 

3. the vertices matchings and the edges matchings are consistent, i.e. the 
graph topology is respected. 

The following paragraphs describes the linear constraints used to integrate 
these conditions into the BLP. 


(i) Vertices matching constraints The constraint 0} ensures that each 
vertex of G\ is either matched to exactly one vertex of Gn or deleted from G 1 , 
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while the constraint ^ ensures that each vertex of G 2 is either matched to 
exactly one vertex of G\ or inserted in G\\ 


Ui + ^2 x i>k = 1 Vi € V 1 
fcev 2 


(4) 


v k + ^2 Xi, k = 1 Vfc e V 2 

iev 1 


(5) 


(ii) Edges matching constraints Similarly to the vertex matching con¬ 
straints, the constraints (JgJ) and ([7]) guarantee a valid mapping between the 
edges: 



kl£E 2 



ij&Ei 


(iii) Topological constraints The respect of the graph topology in the 
matching of the vertices and of the edges is described in the following proposi¬ 
tion : 

Proposition 1. An edge ij £ E\ can be matched to an edge kl £ E 2 if and 
only if the head vertices i £ V 1 and k £ V 2 , on the one hand, and if the tail 
vertices j £ V\ and l £ V 2 , on the other hand, are respectively matched. 

This quadratic constraint can be expressed linearly with the following con¬ 
straints © and 

• ij and kl can be matched if and only if their head vertices are matched: 


Vij,kl — 3'i,k ^(jjikV) G E\ X E 2 


( 8 ) 


• ij and kl can be matched if and only if their tail vertices are matched: 


V(ij,kl) £ Ei x E 2 


(9) 


4.1.4 Straightforward formulation 

Putting equations [3] to [9] altogether leads to a first straightforward version of 
the BLP formulation: 


(FI) 
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min 

x,y ,u,v,e,f 


EE c (^ fc )' 

ieV i k<EV 2 


%i,k 


+ E E c{ij -> kl) ■ yij'ki 

ij&Et UgE 2 

+ E C (* £ ) ' Ui + E C ( C ^ ' Vk 

ieVi fcec 2 

+ E e ) ■ e y + E c ( e ■ f k] 

ij&E! klGE 2 

subject to Ui + Xi t k = 1 Vz £ Vi 

kev 2 

Vk+^2 Xi , k = 1 Vk gV 2 

iev i 

"I" ^ ^ [li j.k.i — 1 Vij £ Hi 

kl£E 2 

fkl + VijM = 1 V/cZ £ £2 

ij&Et 

y%j,ki < *i,fc V(ij, fcZ) e E x x e 2 

Uij,ki < %j,i V(ij,kl) G Ex x E 2 
with ir^fc £ {0,1} V(z, fc) £ V 1 x V 2 

Vij,ki £ {0,1} V(zj, fcZ) G Ex x E 2 

Ui G {0,1} Vi £ Vj 

* 4 -€{0,1} VZc £ V 2 

e *i £ {0,1} Vij £ Hi 

fkl £ {0,1} VfcZ £ H 2 


(10a) 


(10b) 

(10c) 

(lOd) 

(10e) 

(lOf) 

(10g) 

(lOh) 

(lOi) 

(10j) 

(10k) 

( 101 ) 

(10m) 


The domain constraints, from fllOht to ( |10m[ |, are used to ensure that the 
solution is binary. Thus, the formulation (FI) has: 


• |Vi| + \V 2 \ + \Ex\ + \E 2 \ + |Vi| • \V 2 \ + \Ex\■ \E 2 \ variables, 


• |hi| + \ V 2 \ + \Ex\ + \E 2 \ + 2 • \Ex \ ■ \E 2 \ constraints (without the domain 
constraints). 


4.2 Reducing the size of the formulation 

In this subsection, we present a formulation that has been derived from the 
formulation (FI). We show that this formulation reduces the number of variables 
and the number of constraints. It will be shown in section [5] that this new 
formulation is more efficient. 
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4.2.1 Reducing the number of variables 

In the formulation (FI), the variables u. v, e and f help the reader to under¬ 
stand how the objective function and the contraints were obtained, but they are 
unnecessary to solve the GED problem. 

We transform the vertex matching constraints Q and ([ 5 ]) into inequality 
constraints, without changing their role in the program. As a side effect, it 
removes the u and v variables from the constraints: 


^ ^ %i,k — 1 

kev 2 

Vi G Vi 

(11) 

^ ^ %i,k — 1 
iev 1 

Vk G F 2 

(12) 


We do the same for edge matching constraints ([6]) and 0: 


^ ^ Vij,kl 5: 1 
kleE 2 

Vij G Ei 

(13) 

^ ^ Vij,kl E 1 

ijeE-i 

VAZ G E 2 

(14) 


We then replace u, v, e and f variables in the objective function (3| by their 
expressions, which can be easily deduced from equations 0, 0and 0: 


EE c(i -» k) ■ x i; k + E E c(ij — ► kl) ■ Uij,kl 

»£V1 fceVh ij&Ei kl£E 2 

+ ^2 c{i -> e) • + ^2 c(e —» k ) • Vk 

iev 1 kev 2 

+ ^2 c fo’ e) ’ e-ij + E c ( e kl ^ ' f kl 

ij&E! klGE 2 

= E E w* ^- c (* e ) - c ( e -> k )) ■ x i,k 

ieVi fceV 2 ( 15 ) 

+ E E (c(ij -» kl) - c(ij -)• e) - c(e -)• kl) ■ y ijtk i + C 

ij&Ei klGE 2 

( with C = E c(i -»• e) + E c ( e fc ) 

' iev1 feev 2 

+ E ce ) + E c ( e fc o) 

ij'eBi kl£E 2 ' 


As all insertion and deletion variables can be a posteriori deduced from the 
substitution variables, the constraints (flTb to ( 141 describe exactly the same set 
of edit paths than the constraints 0 to 0. Equation (151 shows that the GED 
can be obtained without explicitly computing the variables u, v, e and f. 
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4.2.2 Reducing the number of constraints 

In the formulation (FI), the number of topological constraints, ^ and © , is 
l-Eil • |i? 2 |. Therefore, in average, the number of constraints grows quadratically 
with the mean density of the graphs. We show that it is possible to formulate 
the GED problem with potentially less constraints, leaving the set of solutions 
unchanged. To this end, we propose to mathematically express Proposition 1 
in another way. We replace the constraints Q and (J9]) by the following ones: 

• Given an edge ij £ E\ and a vertex k £ 14, there is at most one edge 
whose initial vertex is k that can be matched with ij: 

£ Vij,H < x i, k Vfc £ V 2 yij £ Ei (16) 

kl£E 2 

• Given an edge ij £ E\ and a vertex l £ V 2 , there is at most one edge 
whose terminal vertex is l that can be matched with ij: 

£ Vij,ki < Xj,i VZ £ Vij^ij £ E 1 (17) 

kl£E 2 


Proposition 2. Let Tj be the set of edit paths (between G 1 and G 2 ) implied by 
the set of admissible solutions of (FI), and let T 2 be the set of edit paths obtained 
similarly by replacing in (FI) the constraints Q and ([ 9 ]) by the constraints (161 
and (l7| . Then Ti = T 2 . 

Proof. 


T 2 C F]: Let ij £ E\ and kl £ E 2 , and let us suppose that (161 is satisfied. 

%i,k ^ ( iJij,kl' 

M’eE 2 

%i,k Vij,kl T ^ ^ Uij.kl' 
kl'£E2,kl'^kl 

= ^' %i,k ^ Vij,kl 

Thus, the constraint ® is satisfied for all ij £ E\ and for all kl £ E 2 . 
Similarly, we deduce that H is satisfied using the constraint (171. 

Ti C T 2 : Let ij £ E\ and k £ V 2 . 

If { £ V 2 : kl £ E 2 } = 0, then J2kieE 2 Vij,ki = 0 and (16) is satisfied. 
Otherwise, using the constraint ([8]), we have: 

\/kl £ F/ 2 ,Xj fc ^ Uij,ki %i,k max (yij.ki) 

kl£E 2 

Constraint ^ ensures that card{Z' £ V 2 : yij.ki = 1} < 1, thus: 


max (yij w) = V Vij,kl' 
kl'£E 2 z ' 

kl'eE 2 


i,k \ Uij.kV 

kl' £i?2 
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and (161 is still satisfied. 


Thus, the constraint (161 is satisfied for all ij £ E\ and for all k 
Similarly, we prove that (171 is satisfied using ([£]) and ([7|. 


V 2 . 

□ 


The number of topological constraints, (16) and ( |17| , is now \E\\ • \V 2 \. 
In average, it grows linearly with the density of the graphs. This leads to 
substantially shorter formulations of the GED as the number of graph vertices 
and edges grows. 

Please note that another substitution of constraints ([8]) and ([£]) is possible, 
namely with the two following constraints: 


F, Vij,ki < X iyk Vi £ El ykl £ E 2 

(18) 

ijeEl 


Vim < x 3,i Vj £ Vj ,Wkl £ e 2 

(19) 

ij^Ei 



This leads to a strictly equivalent formulation in terms of admissible solutions, 
however it changes the number of topological constraints, (18) and (191, that 
would be \E 2 \ ■ |Vi|. 

In addition, we prove that the constraints (13) and (141 are not necessary to 
the formulation of the GED problem, since they are implied by other constraints 
of the BLP. 


Proposition 3. Constraint (131 is implied by (111 and (16) 


Proof. Let ij £ E\. Given (|16j) , we have: 

''f ' Uij,kl kj V/j £ E 2 


kl£E 2 


^ yy ~ yij,ki — y ) %i,k 

k£V 2 kl£E 2 k&V 2 


We reduce the left term of this inequation and we use (111: 

y \ Uij.ki if y ^ — 1 

klGE 2 k£V 2 


Thus, (131 is implied by (11) and (16). Similarly, we prove that (141 is 
implied by (121 and (|l7|). □ 


4.2.3 Simplified formulation 

The results obtained in |4.2.1| and |4.2.2] show that the GED problem can also be 
solved by using (151 as the objective function and (111, (12), (16) and as 
the constraints of the BLP. We finally come up with a simplified formulation of 
the GED problem: 


(F2) 
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\ieVi fcev 2 

- c(i -v e) - c(e -v k)j ■ x i>k 


+ ^2 55 _ e ) - c ( e kl )) ' Vij.ki 

ij&Ex kl£E 2 

+c) 

(20a) 

subject to xi t k < 1 

feev 2 

Vi G V i 

(20b) 

^ ^ Xi z k — 1 
iev i 

Vk G V 2 

(20c) 

VI 

wl 

Vk G V 2 ,Vij G Ei 

(20d) 

VI 

wi 

Xj,i VI G V 2 ,Vij G Ei 

(20e) 

with Xi. k G {0,1} 

V(i, k) G Vi X V 2 

(20f) 

Vij,kl £ {0; 11 

The formulation (F2) has: 

V(ij, kl ) G Ei x E 2 

(20g) 


• I Vi I • |V 2 | + |i?i| • \E 2 \ variables, 

• |Vi| + |V 2 |+2|V 2 |-| j Ei| constraints (without the domain constraints). 

Thus, it uses less variables than (FI), and depending on the density of the 
graphs, it potentially uses less constraints to solve the same problem. 


4.3 Solving the programs 

Solving an ILP is NP-hard WL thus exploring the entire solution tree is not 
an option since it would take an exponential time. However, dedicated solvers 
have been developed to reduce the number of explored solutions and the solving 
time, by using a branch-and-cut algorithm along with some heuristics (3SJ- 


Once equations (2a I to (2c) are correctly formulated, the second step consists 


in implementing this model using a mathematical solver. Given an instance of 
the problem, the solver explores the tree of solutions with the branch-and-bound 
algorithm, and finds the best feasible solution, in terms of the objective function 
optimization. 


4.4 Lower bounding the GED with continuous relaxation 

The common resolution method of an ILP consists in using a branch-and-bound 
algorithm. The continuous relaxation of an ILP, i.e. a linear program (LP) 
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where the constraints remain unmodified but where the variables are now con¬ 
tinuous, is a lower bound of the minimization problem that can be solved in 
polynomial time 0(n 3 5 ) with the interior point method [39]. This lower bound 
helps the ILP solving since it allows to prune the exploration of the solution 
tree. 

However, the continuous relaxation can also be used to approximate the 
optimal objective value in polynomial time. We call F1LP (resp. F2LP) the 
continuous relaxation of FI (resp. F2). To this end, we only substitute the 
discrete space {0,1} by the continuous space [0,1] in domain constraints (12i) 
to (12n). 

5 Experiments 

As stated in the introduction, one of the contributions of this article is to pro¬ 
vide to the reader a robust experimental study through a comparison of eight 
methods. In this section, we first describe the methods that have been studied. 
Then, the datasets and the protocol used to compare the reference methods and 
our proposals are described. Finally, the results are presented and discussed. 

5.1 Studied methods 

We compare the four approaches proposed previously with four other graph edit 
distance algorithms from the literature. From the related work, we chose one 
exact method and three approximate methods. On the exact method side, A* 
algorithm applied to GED problem [2S] is a foundation work. In our tests, the 
heuristic is computed thanks to the approximation based on bipartite graph 
matching. It is the most well-known exact method and it is often used to eval¬ 
uate the accuracy of approximate methods. On the approximate method side, 
we can distinguish three families of methods, tree-based methods, assignment- 
based methods and set-based methods. For the tree-based methods, a truncated 
version of A* called beam search was chosen. This method is known to be one 
of the most accurate heuristic from the literature [22] ■ Among the assignment- 
based methods, we selected the bipartite graph matching described in |21| . In 
(21| . authors demonstrated that this upper bound is a good compromise between 
speed and accuracy. Finally, we picked a very recent set-based method. In 2014, 
A. Fischer et al f32j proposed an approach based on the HausdorfF matching. 
This method is a lower bound of the GED problem. All these methods cover 
a large range of GED solvers. In table [2j for each method, acronym, type of 
method (exact or not) as well as a short synthesis are presented. We could not 
assess our methods against all the state of the art. Among the missing meth¬ 
ods, we did not compare experimentally our proposals against the binary linear 
programs proposed by Justice and Hero [35]. Despite our best efforts, we could 
not find the source code of the method or binary files and neither the datasets 
used in their experiments. 


17 


Acronym / 

Type 

Description of the method 

a* (E g ) 

Exact 

A* algorithm using a bipartite heuris¬ 
tic. 

FI (this paper) 
Exact 

Our first binary linear programming 
formulation. 

F2 ( this paper) 
Exact 

Our second BLP formulation, derived 
from (FI). 

BP |H] 

Upper bound 

Bipartite graph matching using 
Munkres algorithm. 

BS -q [22] 

Upper bound 

A* algorithm with beam search ap¬ 
proach and using a bipartite heuristic. 

H gD] 

Lower bound 

Modified Hausdorff distance applied to 
graphs. 

F1LP this paper 
Lower bound 

Linear programming approach, contin¬ 
uous relaxation of (FI). 

F2LP this paper 
Lower bound 

Linear programming approach, contin¬ 
uous relaxation of (F2). 


Table 2: Notations corresponding to each optimal or suboptimal method 

5.2 Datasets 

Graph edit distance algorithms are applied to three different real world graph 
datasets (GREC, Protein, Mutagenicity) and to one synthetic dataset (ILPISO). 
Real world datasets are described in |4T] while the synthetic dataset is depicted 
in U2J. All datasets are publicly available on IAPR Technical commitee #15 
websit^] From these datasets, we have built subsets where all graphs have the 
same number of vertices in order to evaluate the algorithms behaviours when 
complexity grows. The underlying assumption is that the problem becomes 
more complex as the graphs hold more vertices. Each dataset is described in 
three steps. We first present the application field and the graph construction. 
Secondly, the cost function used for the considered dataset is presented. Finally, 
the interest of the dataset is discussed. A synthesis concerning those data are 
given in table [4] For each dataset, the corresponding subset and the code of the 
cost function are available at https://sites.google.com/site/blpged/. 

5.2.1 GREC dataset (GREC) 

The GREC dataset consists of graphs representing symbols from architectural 
and electronic drawings. The images occur at five different distortion levels. 
The result is thinned to obtain lines of one pixel width. Finally, graphs are ex¬ 
tracted from the resulting denoised images by tracing the lines from end to end 
and detecting intersections as well as corners. Ending points, corners, intersec¬ 
tions and circles are represented by vertices and labeled with a two-dimensional 
attribute giving their position. The vertices are connected by undirected edges 

2 https: //iapr-tc!5 .greyc. f r/links . html#Benchmarking"/,20and'/,20data"/,20sets 
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which are labeled as line or arc. An additional attribute specifies the angle 
with respect to the horizontal direction or the diameter in case of arcs. From 
the original GREC dataset [43] , 22 classes are considered. From IAM GREC 
dataset, subsets were built and their corresponding characteristics is provided 
in table O 


Cost function Additionally to (x, y) coordinates, the graph vertices are la¬ 
beled with a type (ending point, corner, intersection, circle). The same goes 
with the edges where two types (line, arc) are employed. The Euclidean cost 
model is adapted accordingly. That is, for vertex substitutions the type of the 
involved vertices is compared first. For identically typed vertices, the Euclidean 
distance is used as vertex substitution cost. In case of non-identical types on the 
vertices, the substitution cost is set to 2-T ver t ex , which reflects the intuition that 
vertices with different type label cannot be substituted but have to be deleted 
and inserted, respectively. For edge substitutions, we measure the dissimilarity 
of two types with a Dirac function returning 0 if the two types are equal, and 
2 • T e d ge otherwise. Meta-parameters T vertex and r e d ge are explained in section 
parameter settings 5.3.3| Elementary operation costs are set up from [23] and 
they are reported in table [5] 


Dataset interest GREC dataset is composed of undirected graphs of rather 
small size (i.e. up to 20 vertices in our experiments). In addition, continuous 
attributes on vertices and edges play an important role in the matching pro¬ 
cedure. Such graphs are representative of pattern recognition problems where 
graphs are involved in a classification stage. 


5.2.2 Mutagenicity dataset (MUTA) 

Mutagenicity is one of the numerous adverse properties of a compound that 
hampers its potential to become a marketable drug [25]. This dataset consists of 
two classes (mutagen, nonmutagen), which represent molecules. The molecules 
are converted into graphs in a straightforward manner by representing atoms 
as vertices and the covalent bonds as edges. Vertices are labeled with their 
chemical symbol and edges by the valence of the linkage. From this dataset, 
subsets were generated with the idea to build subfolds where all graphs have the 
same number of vertices spaced exactly by 10 : 10 : 70 vertices. Every subfold 
holds exactly 10 graphs. 


Cost function Edge substitutions are free of cost. For vertex substitutions, 
we measure the dissimilarity of two chemical symbols with a Dirac function 
returning 0 if the two symbols are equal, and 2. T ver tex otherwise. 

Dataset interest This dataset is representative of exact matching problems 
in the way that a significant part of the topology together with the corresponding 
vertex and edge labels in Gi and Gi have to be identical. In addition, this set 
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of graphs gathers large instances with up to 70 vertices. Elementary operation 
costs are set up from [40, and they are reported in table [5] 

5.2.3 Protein dataset (PROT) 

The protein dataset contains graphs representing proteins originally used in 
. The graphs are constructed from the Protein Data Bank and labeled with 
their corresponding enzyme class labels from the BRENDA enzyme dataset m- 
The protein graphs are split into six classes (EC 1, EC 2, EC 3, EC 4, EC 5, 
EC 6), which represent proteins out of the six enzyme commission top level 
hierarchy (EC classes). The proteins are converted into graphs by representing 
the secondary structure elements of a protein with vertices and edges of an 
attributed graph. Vertices are labeled with their type (helix, sheet, or loop) 
and their amino acid sequence (e.g. TFKEVVRLT). Every vertex is connected 
with an edge to its three nearest neighbors in space. Edges are labeled with their 
type and the distance they represent in angstroms. A summary of all subsets 
and their corresponding characteristics is provided in table [3} 



PROT 

ILPISO 

GREC 

#vertic< 

:s20 

30 

40 

10 

25 

50 

5 

10 

15 

20 

# graph 

i 15 

13 

22 

12 

12 

12 

41 

74 

34 

39 


Table 3: Subsets decomposition of PROT, ILPISO and GREC datasets 


Cost function For the protein graphs, a cost model based on the amino acid 
sequences is used. For vertex substitutions, the type of the involved vertices 
is compared first. If two types are identical, the amino acid sequences of the 
vertices to be substituted are compared by means of string edit distance. Sim¬ 
ilarly to graph edit distance, string edit distance is defined as the cost of the 
minimal edit path between a source string and a target string. More formally, 
given an alphabet L and two strings si, s2 defined on L (si, s2 e L*), we allow 
substitutions, insertions, and deletions of symbols and define the corresponding 
cost as follows : 

c(u —> v) = c(u —> e) = c(e —► v) = 1 for u, v £ L, u ^ v 

Hence, vertex substitution cost is defined as the minimum cost sequence of 
edit operations that has to be applied to the amino acid sequence of the source 
vertex in order to transform it into the amino acid sequence of the target vertex. 
If two vertex types (helix, sheet, or loop) are not identical then the substitution 
is equivalent to a vertex deletion. For edge substitutions, we measure the dis¬ 
similarity with a Dirac function returning 0 if the two edge types are equal, and 
2 • T e dge otherwise. Elementary operation costs are set up from |44| and they 
are reported in table [5] 


20 


















MUTA 

GREC 

PROT 

ILPISO 

Size 

4337 

1100 

600 

36 

Vertex labels 

Chemical 

symbol 

x » y 

coordi¬ 

nates 

Type 
and aa- 

sequence 

scalar 

value 

Edge labels 

Valence 

Line 

type 

Type 

and 

dis¬ 

tance 

scalar 

value 

vertices 

30.3 

11.5 

32.6 

28.3 

edges 

30.8 

12.2 

62.1 

54.3 

Graph type 

undirecte 

dundirecte 

dundirecte 

ddirected 


Table 4: Summary of the graph datasets characteristics 


Dataset interest The stringent constraints imposed by exact vertex match¬ 
ing is relaxed thanks to the string edit distance. So the matching process can 
be tolerant and accommodate with differences. 

5.2.4 ILPISO dataset (ILPISO) 

Four synthetic datasets are provided. Each of them is composed of several 
triplets (pattern graph, target graph and groundtruth). The graphs have been 
randomly generated thanks to the Erdos-Renyi model {48] with or without the 
constraint of producing connected graphs. For each option, one version of the 
dataset has an exact mapping (equal labels between matched vertices/edges) 
whereas an other version includes noise on label values. The groundtruth in¬ 
formation gives the one-to-one vertex mapping involving the minimal cost as¬ 
signment. Each vertex and edge is labelled with a single continuous value in 
[—100,+100]. Edge and vertex attributes follow a uniform law U{— 100,100). 
A summary of all subsets and their corresponding characteristics is provided in 
table [3] 

Cost function For vertex substitutions, we measure the dissimilarity of two 
vertices with an absolute difference. For vertex deletion and insertion, a fixed 

2 

cost is chosen which is equal to r 100 ss 66,6. Elementary operation costs are 

__ o 

reported in table [5] 

Dataset interest This dataset stands apart from the others in the sense that 
this dataset hold directed graphs. The aim is to illustrate the flexibility of our 
proposal that can handle different type of graphs. 
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”7 "vertex 

1~edge 

a 

Vertex 

substi¬ 

tution 

function 

Edge 

substi¬ 

tution 

function 

GREC 

90 

15 

0.5 

Extended 

euclidean 

distance 

Dirac 

function 

PROT 

11 

1 

0.75 

Extended 

string 

edit 

distance 

Dirac 

function 

MUTA 

11 

1.1 

0.25 

Dirac 

function 

Dirac 

function 

ILPISO 

66.6 

66.6 

0.5 

LI norm 

LI norm 


Table 5: Cost function meta parameters for the four datasets 

5.3 Protocol 

In this section, the experimental protocol is detailed. We explain how the ex¬ 
periments were performed and the reasons why we led these tests. 

Our experiments were carried out in a context of graph comparisons. Let 
S be a graph dataset consisting of m graphs, S = {Gi, G2, G m }. Let V = 
V e U V a be the set of all graph edit distance methods listed in |5.1[ with V e = 
{A*, FI, F2} the set of exact methods and V a = {BP, BS-10, H, F1LP, F2LP} 
the set of approximate methods (see table [ 2 ] for notations). Given a method 
p £ V, we computed the square distance matrix M p £ Al mxm (]R + ), that holds 
every pairwise comparison Mf- = d p (Gi,Gj), where the distance d p (Gt,Gj) is 
the value returned by the method p on the graph pair (Gi, Gj) within a certain 
time limit, and using the cost metaparameters defined in table [5j For instance 
M F1 and M bp denote distance matrices computed with FI and BP methods 
respectively. 

Due to the large number of matchings considered and the exponential com¬ 
plexity of the algorithms tested, we allowed a maximum of 300 seconds for 
any distance computation. When time limit is over, the best solution found so 
far is outputted by the given method. This time constraint is large enough to 
let the methods search deeply into the solution space and to ensure that many 
nodes will be explored. The key idea is to reach the optimality whenever it is 
possible or at least to get as close as possible to the Graal, the optimal solution. 
This constraint on the system is well admitted in the operational research field 

nasi]. 

Based on this context of pairwise graph comparison, a set of metrics is 
defined to measure the accuracy and the speed of our four proposed methods 
and four standard methods. 

In the next subsections, performance evaluation metrics as well as the ex- 
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perimental settings are detailed. 


5.3.1 Accuracy metrics 


To illustrate the error committed by approximated methods over exact methods, 
we measure an index called deviation which is defined by equation |21| 


\Mf- — Rij\ 

deviation(i, j) p = --,V(i, j) £ [l,m] 2 ,Vp € V 


Ri 


( 21 ) 


Where R is defined in equation [22] 


Ri,j = nrin {Mf •}, V(i,j) £ [1 ,mf 

p£V\{FlLP,F2LP,H} l ’ : 


( 22 ) 


For each comparison, the reference matrix holds the optimal graph edit distance 
whenever it is possible to compute it. The optimality may not be reached due 
to time restriction. When no optimal solutions were available, the lowest graph 
edit distance found among all the methods is chosen to be the reference value. 
The lower bounds (H, F1LP and F2LP) are removed from the formula 22 since 


they do not represent feasible solutions and they cannot represent real sequences 
of edit operations. For a given method, the deviation can express the error made 
by a suboptimal solution in percentage of the best solution. 

For each subset, the mean deviation is derived as follows in equation [23] : 


deviation p = 


m x m 


EE deviation(i, j) p 


(23) 


i =1 j=1 

To obtain comparable results between datasets, mean deviations are normal¬ 


ized between [0,1] as follows in equation 24 
deviation score p = 


#subsets — :— p 

\ deviation- 


4ksubsets maxdev. 

2=1 


(24) 


maxdevi = max deviation p \/p £ V 

Deviation score is a type of measurement used to compare performance over 
subsets. 

5.3.2 Speed metrics 

To evaluate the convergence of algorithms, the mean time for each dataset is 
derived as follows in equation [25] : 


1 


time p = -yy time(p , Gi, Gj) and (i,j) £ [1, m]‘ 

m x Tfi 


(25) 


i=l 3 = 1 
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Finally, we introduce a last metric called speed score. To compare speed 
performance over datasets, the running time is normalized between [0,1] as 
follows in equation [26] : 


#subsets - - n 

- i - v 1 time- 

speed score y = —— - > -:- (26) 

#subsets z -—' maxtimei 

i —1 

maxtimei = max time 1 ’ \/p £ V 

These evaluations were run on datasets GREC, PROT, MUTA and ILPISO. 
In order to show the impact of the graph size on the problem complexity, we 
performed our experiment on subsets where all graphs have the same number 
of vertices. 

5.3.3 Experimental settings 

For the understanding of these tests, we first recall notations that will make the 
reading much simpler. Graph edit distance holds meta parameters which are 
domain-depend costs. We borrow notations from Kaspar Riesen thesis report 
[42]. T no d e corresponds to the cost of a node deletion or insertion, T e d ge corre¬ 
sponds to the cost of an edge deletion or insertion, a € [0,1] corresponds to the 
weighting parameter that controls whether the edit operation cost on the nodes 
or on the edges is more important. Elementary operation costs are reported in 
table 0 

In this practical work, the BP was provided by the Institute of Computer 
Science and Applied Mathematics of Bern in Switzerland while other methods 
were re-implemented by us from the literature. All methods are implemented in 
JAVA 1.7 except for the FI and F2 models that are implemented in C# using 
CPLEX Concert Technology. CPLEX 12.6 was chosen since it is known to be 
one of the best mathematical programming solvers. All the methods were run 
on a 2.6 GHz quad-core computer with 8 GB RAM. For the sake of comparison, 
none of the methods were parallelized and CPLEX was set up in a deterministic 
manner. 

5.4 Results 

In this section, we present the results obtained from the experiments. 

In figure |T| the mean deviations of exact methods and approximate methods 
are presented. Note that A* method was only computed on GREC dataset due 
its inherent and intractable time complexity. A*’s experiments could not be 
conducted for graphs larger than 15 vertices with a memory constraint of 1 GB. 
From figure [T] several conclusions can be drawn : On all datasets formulation 
F2 outperforms formulation FI in terms of accuracy. The gap between both 
methods can reach 20% on MUTA dataset. Among the lower bounds, F2LP is 
the most accurate. However lower bounds results are very data dependent. On 

d http://www.iam.unibe.ch/fki/ 
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(a) GREC 


(b) MUTA 


(c) PROT 


(d) ILPISO 


Figure 1: Mean deviations 


GREC dataset, the error committed is less than 5% while on MUTA dataset 
errors can reach 30%. A straightforward remark is that GREC seems to be a 
quite affordable dataset while MUTA is more challenging. In GREC dataset, 
solving F2LP leads to near-optimal solutions. In the linear programming for¬ 
mulations, topological constraints of the models are easy to be satisfied. The 
vertices matching constraints (Eq[4]) of having one vertex of G\ matched to only 
one vertex of G 2 fall apart. Solving the continuous relaxation with continuous 
variables lead to a multivalent matching. The quality of the solution is then 
mainly supported by the objective function. The objective function helps at 
guiding the exploration of the search space. This strengthen the fact that at¬ 
tributes are meaningful and play a more important role than the topology in 
GREC. Among the methods from the literature, BS is the most accurate except 
on ILPISO dataset. This comment can be explained due to the directed aspect 
of the graph involved. Directed edges can be seen as more stringent constraints 
on the topology. Topology may impact significantly the first branching deci¬ 
sions of the beam search algorithm at the expense of attributes. Beam search 
back tracking capability is reduced by truncation of the search search space 
that prevents the method to get back on better branches. A* is probably the 
worst method when graphs are larger than 10 vertices its error becomes very 
high (i.e more than 30%). A* cannot converge to the optimality because of 
memory saturation phenomenon. The list OPEN containing pending solutions 
to be expanded grows exponentially according to the graph size. The bipartite 
heuristic fails to prune the search tree efficiently. To conclude on deviation, the 
bigger the graphs, the higher the error made by all the methods. Approximate 
methods may work poorly in cases where neighborhoods do not allow to easily 
differentiate the partial solutions. Among all approximate methods, F2LP is the 
most accurate. In average, 6% more accurate than the second best approximate 
method which is BS. 

In figure [2] the average time to compute a graph comparison is depicted for 
each method. Between FI and F2 formulations, F2 is always faster than FI. 
On GREC, F2 can be 100 times faster than FI. On the other hand, the bigger 
the graphs, the tighter the difference is. In fact, as graphs get bigger, more time 
is required to solve the problem. When the graph size exceeds 30 vertices the 
speed of both formulations tends to be similar and the time limit is reached. 
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(a) GREC (b) MUTA 


(c) PROT 


(d) ILPISO 


Figure 2: Mean computation times 


(a) GREC (b) MUTA (c) PROT (d) ILPISO 



(e) ALL datasets 


Figure 3: A synthesis on deviation and time complexity. Lowest deviation and 
speed time scores are the best. Sub-figure e is obtained by merging GREC, 
PROT, MUTA and ILPISO results. 


However in the meantime, F2 would reach a better solution. H and BP are by 
far the fastest methods. The speed gap with formulation F2 can reach a factor 
1000 on MUTA set when graphs get larger. On the other hand, at the scale of 
exact methods speed, H and BP provide comparable speed results. Finally, A* 
is the slowest method due to its intensive use of dynamic memory allocation, 
the best-first search and a misleading bipartite heuristic. 

To sum up advantages and drawbacks, each method is projected on a two- 
dimensional space (M 2 ) by using speed score and deviation score features defined 
in equations [26] an d |24| Speed and deviation are two concurrent criteria to be 
minimized. Figure [3] illustrates the methods projected in the speed-deviation 
space. A* method can be categorized as a dominated method since it does not 
outperform any other methods on either deviation or speed criterion. Methods 
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% 

FI 

F2 

F1LP 

F2LP 

BP 

BS 

H 

FI 

0 

3 

8.3 

4.6 

29.5 

12.6 

39.7 

F2 

3 

0 

11.4 

7.7 

32.5 

15.7 

42.8 

F1LP 

8.3 

11.4 

0 

3.7 

21.1 

4.3 

31.4 

F2LP 

4.6 

7.7 

3.7 

0 

24.8 

8 

35.1 

BP 

29.5 

32.5 

21.1 

24.8 

0 

16.8 

10.3 

BS 

12.6 

15.7 

4.3 

8 

16.8 

0 

27.1 

H 

39.7 

42.8 

31.4 

35.1 

10.3 

27.1 

0 


Table 6: Mean deviation gap between methods in percentage over all the 
databases. 


/ 

FI 

F2 

F1LP 

F2LP 

BP 

BS 

H 

FI 

1 

0.86 

0.28 

0.16 

0 

0.18 

0 

F2 

1.16 

1 

0.33 

0.18 

0 

0.2 

0 

F1LP 

3.55 

3.06 

1 

0.55 

0 

0.62 

0 

F2LP 

6.44 

5.54 

1.81 

1 

0 

1.13 

0 

BP 

4648.3 

4002 

1309 

721.8 

1 

817.06 

1.8 

BS 

5.69 

4.9 

1.6 

0.88 

0 

1 

0 

H 

2514.2 

2164.7 

708 

390.4 

0.54 

441.9 

1 


Table 7: Mean time factor between methods over all the databases. 


behave differently according to the datasets. There’s no such thing as a free 
hmc/^] : error-tolerant matching is an NP-hard problem and no methods can 
fit all problems. A quantitative analysis is proposed in tables [6] and [7] In 
table [6] deviation gaps between methods are presented while in table [7] the 
time ratio between methods are depicted. Generally speaking, mathematical 
models seems to be quite accurate and outperform in this way other methods 
from the literature. F2 outperforms the other methods on all datasets in term 
of accuracy. Among approximate method, F2LP is the most precise heuristic. 
The gap between F2 and F2LP is about 7%. The most challenging conventional 
method is BS which is 5 times faster than F2 but in average over all the datasets, 
F2 is 15% more accurate than BS. On the other hand, BP and H are the fastest 
methods and any instance can be solved in less than three seconds. Among 
approximate methods, F2LP is 8% more accurate than BS in average the reverse 
side of the medal is an extra amount of time of 13% in average. 

6 Conclusion 

In this paper, two exact binary linear programming formulations of the graph 
edit distance problem have been presented. Both formulations can deal with 
wide range of attributed relational graphs : directed or undirected graphs, sim- 

4 A quote from the economist Milton Friedman 
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pie graphs or multigraphs, with a combination of symbolic, numeric and/or 
string attributes on vertices and edges. The first formulation (FI) is a didac¬ 
tic expression of the GED problem, while (F2) is a more refined program where 
variables and constraints have been condensed to reduce the search space. From 
both exact models, two lower bounds (F1LP) and (F2LP) have been derived by 
continuous relaxation of binary variables. Models were solved by CPLEX solver 
based on branch-and-cut techniques and the interior point method. Formula¬ 
tions were evaluated on four publicly available databases. In all cases, (FI) and 
(F1LP) are slower and less accurate than (F2) and (F2LP) respectively. This 
result validates (F2) and the choice of reducing the number of variables and 
constraints. (F2) is 15% more accurate in average than the best method from 
the literature. Among approximate methods, (F2LP) is 8% more accurate than 
BeamSearch (BS) in average the reverse side of the medal is an extra amount 
of time of 13% in average. To take the stock, the choice of a method to solve 
a problem is a trade-off between speed and accuracy. In perspective, quadratic 
programming solvers are getting more and more efficient and we want to inves¬ 
tigate the definition of binary quadratic programming formulations of the graph 
edit distance problem. Finally, another interesting work will be to use lower and 
upper bounds to build an optimized nearest neighbour search. 


Appendices 


A Extension to undirected graphs 

Suppose that Gi and Gi are undirected graphs, i.e. their edges have no orien¬ 
tation. The notations ij and ji refer to the same edge of E\, so do kl and Ik 
in E 2 . This new assumption leads to revise the constraints of the formulation 
given for directed graphs. 

Considering that (F2) has been shown to be more effective than (FI), we only 
give (F2u), the formulation dedicated to compute graph edit distance between 
undirected graphs adapted from (F2). The modifications consist in rewritting 
the sets of constraints (16) and into ( |27d[ ). Indeed, given an edge ij £ E\ 
and a vertex k £ V 2 , there is at most one edge incident to k that can be matched 
to ij. Moreover Xi t k and Xj k can not be simultaneously equal to 1, so the sum 
Xi.k + %j,k is at most equal to 1. 


(F2u) 
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min I ^c(i —X k) — c(i —> e) — c(e —> /c)^j • a;^ 

x,y VieVifeeVa 

+ E E( c(*j -X £'Z) - c(ij -X e) 

ijS-Ei klGE 2 

— c(e —x kl)^j ■ ytj^ki + C 

subject to 

Y. x,; ife <1 Vi G Vi 

kev 2 

Y X i>k <1 Vfc G V2 

ieVi 

^ ' Vij,kl Si T Xj^ Wk G V 2 , Vij G E\ 

kleE 2 

with 

*i,fc G {0,1} V(i, fc) G Vi x V 2 
Dij,ki e {0,1} V(ij, feZ) G £1 x £2 


(27a) 


(27b) 

(27c) 

(27d) 

(27e) 

(27f) 
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